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Analyzing the big stream data and other valuable information is a significant 
task. Several conventional methods are designed to analyze the big stream 
data. But the scheduling accuracy and time complexity is a significant issue. 
To resolve, an elastic-net kernelized multivariate discriminant map reduce 
classification (EKMDMC) is introduced with the novelty of elastic-net 


regularization-based feature selection and kernelized multivariate fisher 


Discriminant MapReduce classifier. Initially, the EKMDMC technique 
executes the feature selection to improve the prediction accuracy using the 
Elastic-Net regularization method. Elastic-Net regularization method selects 
relevant features such as central processing unit (CPU) time, memory and 
bandwidth, energy based on regression function. After selecting relevant 
features, kernelized multivariate fisher discriminant mapr classifier is used to 
schedule the tasks to optimize the processing unit. Kernel function is used to 
find higher similarity of stream data tasks and mean of available classes. 
Experimental evaluation of proposed EKMDMC technique provides better 
performance in terms of resource aware predictive scheduling efficiency, 
false positive rate, scheduling time and memory consumption. 


Keywords: 


Big stream data 

Elastic-net regularization 
Feature selection 

Kernelized multivariate fisher 
discriminant 

MapR classifier 

Prediction accuracy 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Arunadevi Nakkiran 

Department of Computer Science, Periyar University 
Salem, Tamilnadu, India 

Email: haseenaa@gmail.com 


1. INTRODUCTION 

In the era of big data, organizations have started to use big data stream computing as it has several 
advantages and risks from real-time big data. Big data stream computing hence has evolved as a mean in 
several applications including social networks, trading, video surveillance, and fraud identification and so on. 
Several research works have been incepted with both opportunities and challenges. Elastic online scheduling 
framework for big data streaming applications (E-Stream) by Sun et al. [1] with the objective of reducing the 
system response time and application fairness. But less focus was made on predictive scheduling accuracy. A 
novel predictive scheduling framework was designed by Li et al. [2] with the objective of ensuring fast and 
stream data processing. However, the time complexity in predictive scheduling remained unsolved. Given the 
significant nature of big data and big data analytics, critical analysis pertaining to big data challenges was 
presented by Sivarajah et al. [3]. According to Fernandes et al. [4] the finding of analysis from a metallurgic 
company was presented. Multivariate Gaussian function was used by Toit [5] to monitor critical variables. A 
fast and efficient distributed stream processing framework was presented by Choi et al. [6]. Bowden et al. [7] 
the design and prototype implementation of enabling predictive maintenance of industrial equipment was 
designed. Taxonomy, review, and future directions related to scheduling in distributed stream processing 
systems were designed Liu and Buyya [8]. A resource aware dynamic data stream model was designed by 
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Gautam et al. [9]. Yet another work by Usama et al. [10] focused on the problems and challenges in big data. 

To overcome the above issues such as higher predictive scheduling accuracy and minimum the time 

complexity, less false positive rate. In this paper, an efficient technique called elastic-net kernelized 

multivariate discriminant map reduce classification (EKMDMC) is introduced. The novel contributions of the 
proposed method include the following. 

z For pre-processing, Elastic-net regularization is a regression method is introduced to accurately estimate 
the relationship between dependent and independent variables to avoid overfitting of model, according 
to the least absolute shrinkage and selection operator (LASSO) predictive model. Elastic-Net 
Regularization is a regression method applied to not only estimate the relationship among dependent 
and one or more independent variables, but also to avoid overfitting of model on training data. 

- For classification, Elastic-net kernelized multivariate discriminant mapreduce classification is 
presented for reducing scheduling time by using minimum resources. Resource efficient processing unit 
prediction is performed via Kernelized Multivariate fisher discriminant mapreduce classifier 
(KMFDMC) with relevant features such as CPU time, Bandwidth Utilization, Memory Consumption 
and energy for scheduling tasks. 

E The proposed technique is implemented in Python and tested with resource aware predictive scheduling 
efficiency, false positive rate, scheduling time and memory consumption for varying number of stream 
data. 

The rest of paper is organized as shown in: A related work is presented in section 2. Design and 
implementation details of proposed technique are presented in section 3. A detailed discussion is presented in 

section 4 and concludes in section 4. 


2. RESEARCH METHOD 

According to Dehkordi and Zamanifar [11] a deadline aware scheduling framework was designed 
for minimizing latency and utilization cost. Yet another graphic processing units (GPU) enabled online 
stream data processing was designed by Chen et al. [12]. Modified first-fit based run time aware data stream 
scheduling strategy was designed by Sun et al. [13]. Dual channel pipeline parallel data processing model 
was designed in [14]. Yet another method based on double level hybrid genetic algorithm and ant colony 
optimization was presented by Xu et al. [15] to address dynamic simultaneous scheduling problem. A 
relatively novel intelligent model was designed in [16]. A review of complexity of managing bit data was 
presented in [17]. However, the scheduling process was not considered. To address this issue, by Gil et al. 
[18] a flexible resource-constrained project scheduling issue with competency differences was presented. A 
comprehensive approach based on novel deep learning models was presented in [19]. A novel priority-aware 
streaming media multi path data scheduler mechanism was designed in [20] for multimedia Multipathing 
services. Integrated support for similarity queries in a parallel Big Data management system was introduced 
in [21]. Haery based query system called Hadoop query (Haery) was developed in [22] to process the high 
dimensional data. Machine learning approach was introduced in [23] for reality awareness and optimization 
in cloud. Two decision tree classification methods were introduced in [24] for automatically find priority 
rules to solve the resource constrained project scheduling problem (RCPSP). A [25], [26] centralized 3- 
dimensional radio resources (namely, time, frequency, and power) allocation and scheduling approach for 
control-plane and [27] user-plane (C-/U-plane) separation architectures for fifth generation mobile networks. 


3. RESULTS AND DISCUSSION 

This big data stream refers to term used in representing huge amounts of data where continuous data 
stream is processed for extracting real-time insights. Such large voluminous data appears in different formats 
that cannot be processed with traditional methods. In this work, elastic-net kernelized multivariate 
discriminant MapReduce classification (EKMDMC) is presented to perform feature selection and resource 
aware predictive scheduling for big data stream. 

Figure 1 shows architecture diagram of proposed EKMDMC technique. The input is obtained from 
the big dataset ‘Dj’. Consider ‘m’ number of processing units ‘p1, p2,p3, ... pm’ that process ‘n’ number of 
data streams ‘sd1, sd2, sd3, ... sdp’. Initially, Elastic-net regularization is applied to perform feature selection 
for selecting relevant features such as central processing unit (CPU), bandwidth, memory and energy. After 
selecting relevant features, resource efficient processing unit is determined by applying kernelized 
multivariate fisher discriminant MapReduce classifier. Finally, stream data task scheduling is carried out with 
higher accuracy. The different processes involved in design of the EKMDMC technique are described in the 
forthcoming sections. 
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Figure 1. Architecture diagram of proposed EKMDMC technique 


3.1. Elastic-net regularization based feature selection 

The first process of Feature selection is performed here by applying Elastic-Net regularization. 
Elastic-Net regularization is a regression method applied to check the overfit present in the training data. 
EKMDMC technique uses the Elastic-Net regularization method to solve the issue based on LASSO 
Predictive model. Here, single variable from a group of highly relevant variables, and rejecting highly 
irrelevant variables. This is performed by adding regularization term to given equation. Besides, Elastic-Net 
Regularization is applied for both parameter estimation (i.e. prediction of average data processing time) and 
feature selection, where more relevant features are selected among group of features for performing 
predictive scheduling. The elastic-net method is defined as (1), 


p = arg min(ly — a@Fl? + P2læl? + Pilal) (1) 


From (1), ‘F’ representing a feature set ‘{F1, F2, F3, .... Fn}? and regularization term of ‘lal’, ‘PI? 
and ‘P2’ parameters controlling the importance of regularization term with value between ‘0’ and ‘1’, 
apredicted output ‘y’ is determined using regression coefficient ‘p’. The regression coefficient returns with 
zero for irrelevant features and one for relevant features. Using (1), relevant features such as central 
processing unit (CPU) time, memory and bandwidth, energy are selected for predicting average processing 
time of stream data. CPU time here, refers to time consumed in accomplishing task and as shown in (2). 


tcpu = tct( sdi) (2) 


From (2), the CPU time of processing unit ‘tepu’ refers to stream data ‘sdi’ task completion time 
‘tct’. One of the main characteristics of processing unit is memory that refers to storage space utilized by 
processing unit to complete certain task. The memory utilization of processing unit is expressed as (3). 


Mut = mt -mud (3) 
From (3), the memory utilization of processing unit ‘mut’ is the difference between total memory 
‘mt’ and unused space of processing unit ‘mud’. Besides CPU time and memory utilization, energy 
consumption of processing unit is considered for processing stream data. The energy consumption ‘Ec’ refers 
to difference between the total energy ‘ET’ and remaining energy ‘ET’ of processing unit and given as (4). 
Ec = ET -ER (4) 
Finally, bandwidth utilization ‘bwu’ is average rate of data transfer speed of processing unit, that is 


difference between available bandwidth ‘bwt’ and unused bandwidth ‘bwud’. The bandwidth utilization of 
processing unit is computed as shown in (5). 
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The elastic-net regression method predicts the processing time of unit. The stream data tasks with 
lower CPU time and lower task size (i.e. memory) take less processing time. The elastic-net regression 
method increases the scheduling accuracy by selecting the relevant features of the processing unit. 


3.2. Kernelized multivariate fisher discriminant mapreduce classifier 

After selecting relevant features, resource efficient processing unit prediction is performed via 
kernelized multivariate fisher discriminant mapreduce classifier (KMFDMC) with relevant features. 
MapReduce function includes two phases namely map phase and reduce phase. Here, the streams data are 
mapped to appropriate processing unit using kernelized multivariate fisher discriminant with relevant 
features. Next, a summary operation is carried out by providing final output results. Figure 2 illustrates 
KMFDMC for efficient prediction as well as resource aware scheduling. Let us consider a number of stream 
data tasks ‘sd1, sd2, sd3, ... sdn’ as input. Initially, a number of classes (i.e. processing unit) ‘ pl, p2, p3, .... 
pw’ are initialized. 
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Figure 2. Flow process of kernelized multivariate fisher discriminant mapreduce classifier (KMFDMC) 


Map Phase takes an input and transforms the content into Key-Value pair in which the key forms 
distinctive keywords combination. Based on combination of keywords, the Kernelized Multivariate Fisher 
Discriminant predicts the processing unit and performs scheduling process. The KMFDMC uses discriminant 
vector maps to map different incoming stream data into different classes. Fisher defined as the separation 
function which is ratio of variance between classes to variance within class is defined as shown in (6). 


Ob Wsr (b)d 


Sas (6) 


Ow wsp(b)d 


From (6), the separation function ‘Sp’ refers to ratio of variance between class ‘op’ and variance 


within the class ‘-’. This is obtained using a linear discriminant vector ‘w’ into class based on optimal 
ow 


projection direction ‘d’ with aid of scatter matrix between ‘s, (b)’ and within ‘s, (w)’classes. Scatter matrix 
is applied to find whether the processing unit is suitable to handle stream data. Besides scatter matrix, a mean 
value is initialized for each class. In KMFDMC, a kernel function finds the similarity between the mean of 
the class and stream data tasks. Here the distance similarity is measured to find the resource efficient 
processing unit. 


k(sdi, uj) = Isdi — pyjl2 (7) 
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From (7), distance similarity is obtained via kernel function ‘k(sdi, uj)’. With resultant value, fisher 
discriminant analysis identifies the minimum distance between the stream data task ‘sdi’ and mean of classes 
‘uj’ (i.e. processing unit). 


f(x) = arg min Isdj — yj (8) 


From (8), f(x) denotes an output of fisher discriminant analysis, arg min denotes an argument of 
minimum function. The minimum distance represents higher similarity between mean and stream data task. It 
means that specific processing unit is suitable for completing certain stream data task with less resource 
utilization. Fisher discriminant analysis predicts efficient processing unit for all incoming tasks. After 
predicting resource efficient processing unit, the stream data tasks are scheduled with corresponding unit. 


Algorithm 1 Elastic-Net Kernelized Multivariate Discriminant Map Reduce Classification 


Input: Number of stream data task sdj,,Sd,,sd3,...sd,, processing unit (P4,P2,P3,..++Dn) 
Output: Improve resource aware predictive scheduling efficiency 
Begin 


\\ feature selection 

1. Apply regression p to select the features Central Processing Unit (CPU) 

time, memory, bandwidth, energy 

2s For each processing unit p; 

3: Calculate tepur Mut, Ec, bw, 

4. End for 
\\prediction and scheduling 

5; Initialize number of classes ¢; 
6. Define class separability function Sn 
hs Define the mean of the class pj 
8. For each stream data task sdi 
9. For each mean of the class pj 


10. Measure similarity k(sdj, 1) 

2 
11. Find minimum distance argmin ||sd; — 1;|| 
12. Predict resource efficient pi 


13. Schedule sd; to p 
14. End for 
15. End for 

End 


Algorithm 1 describes the Elastic-Net kernelized multivariate discriminant MapReduce 
classification (EKMDMC) to improve scheduling efficiency by utilizing minimum resources. 


4. FIGURES AND TABLES 

The experimental evaluation is performed with epileptic seizure recognition dataset 
https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition. Epileptic seizure recognition dataset is 
high dimensional dataset includes 11500 instances and 179 attributes. Comparative analysis and discussion is 
made four different parameters, resource aware predictive scheduling efficiency, false positive rate, 
scheduling time and memory consumption with number of stream data tasks. Associated tasks are 
classification and clustering. The dataset characteristics are multivariate and time series. Experimental 
configuration as shown in Table 1. 


Table 1. Tabulation for experimental configuration 


Requirements Specification 
Software Python 3.5 
Processor Intel 13-4130 3.40GHz 

RAM 2 GB and above 
Operating System Windows 7, 10 


4.1. Impact of resource aware predictive scheduling efficiency 
Resource aware predictive scheduling efficiency measures percentage ratio of number of tasks 


correctly scheduled to resource aware optimized processing unit ‘RAOPUj °’ to number of data task ‘N’. As 
shown in (9). 
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RPSE = RAOPU; 


«100 (9) 


From (9), resource aware predictive scheduling efficiency ‘N’ is measured in percentage. Table 2 
illustrates the convergence graph of resource aware scheduling efficiency using three different methods, 
EKMDMC, E-Stream [1] and predictive scheduling framework [2]. With the increase in thenumber of 
data tasks from 100 to 1000, the convergence graph shows a decreasing trend and then increasing trend is 
found. Hence, the graph of resource aware scheduling efficiency is neither inversely nor directly proportional 
to number of tasks. With 100 number of data tasks considered for experimentation. Resource aware 
predictive scheduling efficiency using EKMDMC, E-Stream [1], and predictive scheduling framework [2] 
was ‘83.71%’, °77.57%’and ‘70.42%’. 


Table 2. Tabulation for resource aware scheduling efficiency 


Number of Resource aware scheduling efficiency (%) 

Data task © EKMDMC E-Stream Predictive Schedulingframework 3D radio 
100 83.71 771.57 70.42 63.38 
200 79.25 76.35 67.55 56.75 
300 18.55 74.25 63.35 52.45 
400 76.35 71.15 61.15 51.15 
500 74.25 67.55 64.55 61.55 
600 77.15 64.35 60.35 56.35 
700 78.35 67.25 63.33 59.41 
800 80.45 69.35 61.55 53.75 
900 78.15 71.15 64.55 57.95 
1000 T7155 69.45 67.35 65.25 


Figure 3 shows the comparison of impact of resource aware predictive scheduling efficiency. The 
resource aware predictive scheduling efficiency using EKMDMC was improved due to the application of 
Elastic-Net kernelized multivariate discriminant MapReduce classification algorithm. This improved the 
resource aware scheduling efficiency using EKMDMC by 10% compared to [1], 21% compared to [2] and 
36% compared to [25], [26]. 
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Figure 3. Comparison of resource aware predictive scheduling efficiency 


4.2. Impact of false positive rate 
False positive rate refers to the percentage ratio of number of tasks incorrectly scheduled to resource 
aware processing unit ‘“RAOPUIncorrect’ to number of data task ‘N’, as shown in (10). 


RAOPU 
FPR = n 


* 100 (10) 


From (10), false positive rate (FPR) is measured in percentage (%). Lower false positive rate 
ensures the efficiency of the method. Table 3 illustrates the convergence graph of resource aware scheduling 
efficiency using three different methods, EKMDMC, E-Stream [1] and predictive scheduling framework 
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[2]. With the increase in the number of data tasks from 100 to 1000, the convergence graph shows a 


decreasing trend and then increasing trend is found. FPR using EKMDMC, E-Stream [1], and predictive 
scheduling framework [2] was °13.28%’°, ‘20.42’ and ‘27.57’. 


Table 3. Tabulation for false positive rate 


Number of False positive rate (%) 

Data Tasks _-EKMDMC _ E-Stream Predictive Scheduling framework 3D radio 
100 13.28 20.42 27.57 34.72 
200 14.35 22.25 28.55 34.85 
300 16.2 24.55 30.35 36.15 
400 18.35 28.15 32.25 36.35 
500 20.45 31.35 34.55 37.75 
600 22.55 34.55 38.15 42.25 
700 29.25 36.25 40.25 44.25 
800 32.15 38.35 42.25 46.15 
900 34.44 40.25 43.55 47.85 
1000 38.25 42.35 44.55 48.75 


Figure 4 shows the false positive rate with 1000 different numbers of tasks. With the increase in 
number of data tasks, the FPR is found to be in the increasing trend. The FPR using EKMDMC, E-Stream [1] 
and Predictive Scheduling framework [2] was ’13.28%’, ‘20.42’ and ‘27.57’. From that, the false positive 
rate was reduced in EKMDMC. This is because of application of Kernelized Multivariate Fishe Discriminant. 
The FPR using EKMDMC is reduced by 26% compared to [1] 35% compared to [2] and 43% when 
compared to [25], [26]. 
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Figure 4. Graphical representations for false positive rate 


Impact of scheduling time scheduling time refers to time consumed in scheduling resource aware for 
single data. It is formulated as shown in (11). 


ST = N * Time[RAPS] (11) 


From (11), scheduling time ‘ST’ is measured based on number of data tasks given as input ‘N’ and 
time consumed in scheduling data tasks ‘Time[RAPS]’ for single data in a resource aware manner. It is 
measured in milliseconds (ms). Table 4 shows the convergence graph of scheduling time measured for 1000 
different numbers of data tasks. With increase in the number of data tasks, the time consumed in scheduling 
also increases due to the increase in the size of data tasks. Therefore, the overall scheduling time using 
EKMDMC, E-Stream [1] and Predictive Scheduling framework [2] were observed to be ‘0.148ms’, 
*0.242ms’ and ‘0.27ms’. 

Figure 5 show the graphical representation for scheduling time. However, with the sample of ‘100’ 
number of data tasks. Therefore, the overall scheduling time using EKMDMC, E-Stream [1] and Predictive 
Scheduling framework [2] were observed to be ‘0.148ms’, ‘0.242ms’ and ‘0.27ms’. From this analysis, the 
scheduling time using EKMDMC were lesser than [1], [2]. This is because of reason that with the application 
of Elastic-net Regularization. The scheduling time using EKMDMC was lesser than 30%, 47%, 52% 
compared to [1], [2], [25], [26]. 
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Table 4. Tabulation for scheduling time 


Number of Scheduling time (ms) 

Data Tasks KMDMC E-Streamctive Scheduling framework 3D Radio 
100 0.148 0.242 0.27 0.35 
200 0.185 0.255 0.345 0.435 
300 0.203 0.305 0.405 0.505 
400 0.275 0.325 0.475 0.525 
500 0.305 0.375 0.515 0.545 
600 0.31 0.405 0.575 0.605 
700 0.318 0.465 0.625 0.658 
800 0.325 0.525 0.676 0.694 
900 0.341 0.574 0.705 0.715 
1000 0.375 0.604 0.725 0.760 
0.8 
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Figure 5. Graphical representations for scheduling time 


4.3. Impact of memory consumption 

Memory consumption refers to memory consumed in scheduling resource aware for single data. As 
shown in (12). From (12), memory consumption ‘MC’ is measured based on number of data tasks given as 
input ‘N’ and memory consumed in scheduling data tasks ‘Space[RAPS]’ for single data in resource aware 
manner. It is measured in kilobytes (KB). Table 5 shows the convergence graph of scheduling time measured 
for 1000 different numbers of data tasks. With increase in the number of data tasks, the time consumed in 
scheduling also increases due to the increase in the size of data tasks. Hence, from the table it is inferred that 
the scheduling time is directly proportional to number of data tasks. However, with the sample of ‘100’ 
number of data tasks. Therefore, the overall scheduling time using EKMDMC, E-Stream [1] and Predictive 
Scheduling framework [2] were observed to be ‘69KB’, ’97 KB’ and ‘124 KB’. 


MC = N * Space[RAPS]| (12) 


Table 5. Tabulation for memory consumption 
Memory consumption (KB) 


Number of 


Data Tasks EKMDMC E-Stream Predictive Scheduling 3D radio 
framework 
100 69 97 124 151 
200 110 133 164 178 
300 113 154 188 203 
400 117 183 214 220 
500 122 194 223 29 
600 130 213 239 27 
700 143 224 254 29 
800 154 233 284 26 
900 173 254 288 310 
1000 188 288 319 328 


Figure 6 shows the results of memory consumed in scheduling data. From the figure, it is illustrative 
that the memory consumption is reduced using EKMDMC technique as compared to [1] and [2]. This is 
because of application of regression method. Therefore, memory consumed in resource aware predictive 
scheduling using EKMDMC technique is reduced by 32% compared to [1], 42% compared to [2] and 45% 
compared to [25], [26]. 
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Figure 6. Graphical representations for memory consumption 


5. CONCLUSION 

This paper presents elastic-net kernelized multivariate discriminate map reduce classification 
(EKMDMC) technique, which is a resource aware predictive scheduler for big stream data. It selects more 
relevant feature using KMFDMC to perform resource aware predictive scheduling and ensure that incoming 
stream data tasks are scheduled. EKMDMC technique reduces the false positive rate by utilizing the Fisher 
discriminant analysis. Simulation results show that EKMDMC technique provides better performance in 
terms of false positive rate, scheduling time, Memory consumption and resource aware predictive scheduling 
efficiency. However, EKMDMC technique considers only limited number of stream data task. In future 
work, number of data task further increased to evaluate performance of EKMDMC technique. Hence, future 


work of EKMDMC technique can be proceeded to solve multi-mode resource constrained project scheduling 
problem (MRCPSP). 
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